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Abstract 

The paper considers various formalisms based on Automata, Tem- 
poral Logic and Regular expressions for specifying queries over finite 
sequences. Unlike traditional semantics that associate true or false 
value denoting whether a sequence satisfies a query, the paper presents 
distance measures that associate a value in the interval [0, 1] with a 
sequence and a query, denoting how closely the sequence satisfies the 
query. These measures are defined using a spectrum of normed vec- 
tor distance measures. Such similarity based semantics can be used 
for retrieval of database sequences that approximately satisfy a query. 
Various measures based on the syntax of the query and the tradi- 
tional semantics of the query are presented. Efficient Algorithms for 
computing these distance measures are presented. 
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1 Introduction 



Recently there has been much interest in similarity based retrieval from se- 
quence databases. The works in this area, consider a database of sequences 
and provide methods for retrieving sequences that approximately match a 
given query. Such methods can be applied for retrieval from time-series, 
video and textual databases. Earlier work in this area [FRM94] considered 
the case when the query is given by a single sequence and developed fast 
methods for retrieving all database sequence that closely match the query 
sequence. In this paper, we consider the case when the query is given by 
a predicate over sequences specified in various formalisms, and give efficient 
methods for checking if a database sequence approximately satisfies the query 
predicate. 

We consider formalisms based on automata, temporal logic, and regular 
expressions for specifying queries over sequences. We define similarity based 
semantics for these formalisms. More specifically, for a database sequence d 
and query q, we define a similarity measure that denotes how closely d satis- 
fies the query q. This measure ranges from zero to one denoting the different 
levels of satisfaction; higher values denote greater levels of satisfaction with 
value one indicating perfect satisfaction. Actually, we define a distance mea- 
sure between d and q, and define the similarity measure to be (1 - distance 
measure). (Both the distance and the similarity measures normally take val- 
ues in the interval [0 , 1] . However, if there is no sequence of length d that 
satisfies q then the distance meaure can take the value oo and the similarity 
value is —oo in this case). 

For example, in a database consisting of daily stock sequences, one might 
request a query such as — "retrieve all the daily stock patterns in which IBM 
stock price remained below 70 until the Dow-Jones value reached 10,000". 
Such a query can be expressed in Temporal Logic (or any of the other for- 
malisms) by considering "IBM stock price is less than 70" and "Dow- Jones 
value equals 10,000" as atomic propositions. An answer to such a query 
will not only return sequences that exactly satisfy the temporal predicate, 
but also those sequences that satisfy it approximately, i.e., those sequences 
having a similarity value greater than a given threshold specified by the user. 

The distance measures that we define are classified in to semantics based 
and syntax based measures. We define two types of semantic based distance 
measures. In both of these types, the distance measure is defined using the 
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exact semantics of the query, which is given by a set 5* of sequences. When 
the query q is given by an automaton then S is the set of sequences accepted 
by the automaton; when q is a temporal logic formula, then S is the set of 
(finite) sequences that satisfy the formula; when g is a regular expression, 
then S is the language specified by the regular expression. The first semantics 
based distance measure between d and q, is defined to be the minimum of 
the vector distances between d and each sequence in the set S. The second 
distance semantic measure is more complex and is based on replacing certain 
symbols in sequences of S by the wild card symbols (section 2 contains the 
actual definition). By using various norm vector distance functions, we get 
a spectrum of the two types of semantic distance measures. 

The syntax based distance measure is defined only for the cases when the 
query is specified by a temporal formula or by a regular expression. In this 
case,the distance measure is defined inductively based on the syntax of the 
query, i.e. it's value is defined as a function of the distance measures of d 
with respect to the top level components of q (i.e., sub-formulas when q is 
a temporal logic formula). For example, the syntax distance with respect to 
the temporal formulas g Ah is defined to be the maximum of the distances 
with respect to g and h. We relate the syntax and semantics based distance 
measures. 

We present algorithms for computing the syntactic and semantic distances 
for a given database sequence and a query. For the case when the query is 
given by an automaton or by a regular expression, the algorithms for the 
first semantic distance measure have linear complexity in the size of the 
automaton and polynomial complexity in the length of the sequence (actually, 
the complexity is linear in the length of the sequence for the infinite norm, 
and is quadratic for other cases); for the second semantic distance measure 
the algorithms have the same complexity with respect to the length of the 
sequence, but have triple exponential complexity in terms of the automaton 
size. When the query is given by a temporal logic formula, the algorithms for 
the semantic distance measures have the same complexity with the following 
exception: the first semantic distance measure has exponential complexity 
in the length of the formula; this blow up is caused by the translation of the 
temporal formula in to an automaton. 

The algorithms for computing syntactic distance measures have linear 
complexity in the length of the query and polynomial complexity in the length 
of the database sequence; (more specifically, the complexity with respect to 
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the database sequence is linear for the infinite norm vector distance function, 
and is quadratic in other cases). 

The paper is organized as follows. Section 2 gives definitions and notation. 
Section 3 reviews and presents algorithms for the case when the query is 
specified by automata. Section 4 defines the distance measures for temporal 
logic and presents algorithms for computing these. Section 5 presents the 
corresponding results for the regular expressions. Section 6 briefiy discusses 
related work. Section 7 contains conclusions. 

2 Definition and Notation 

In this section, we define the various formalisms that we consider in the paper 
and their similarity based semantics. For a sequence s = (sq, Si, Sj, ...s„_i), 
we let s[i] denote its suffix starting from Sj. We let length{s) denote the 
length of s. A null sequence is a sequence of length zero. If s, t are sequences, 
then we let st denote the concatenation of s and t in that order. We represent 
a sequence having only one element by that element. A sequence over a set 
A is a sequence whose elements are from A. A language L over A is a set 
of sequences over A. We let A* represent the set of all such sequences. If 
L, M are languages over A then LM and L* are languages over A defined 
as follows: LM = {a/? : a e L,/3 e M}; L* = U>o-L^ here U is the 
concatenation of L with itself i times; L° is the singleton set containing the 
empty string. For a language L over A, we let L be the complement of L, i.e., 
L = A* — L. Whenever there is no confusion, we represent a set containing 
a single element by that element itself. 

Let A be a set of elements, called atomic queries. Each member of A 
represents an atomic query on a database state. With each database state u 
and atomic query a, we associate a similarity value simval{u, a) that denotes 
how closely d satisfies a. This value can be any value between zero and one 
(one indicates perfect satisfaction). We use a special atomic query ^ A, 
called wild card, which is always satisfied in every database state; that is, 
simval{u,(f) = 1 for every database state u. Let d — {do,di, ...,dn-i) be 
a sequence of database states and a = (ao, .... ctn-i) be a sequence over 
A U {4>}. Corresponding to d and a, we define a sequence of real numbers 
called simvec{d, a) defined as follows. Let io < ^^i < ■•■ < im-i be all values of 
j such that < j < n and aj ^ (f). We define simvec{d, a) to be the sequence 



4 



{xq, ...,Xm-i) where xj = simval{di-, ai ) for < j < m. Intuitively, we 
define simvec{d, a) by ignoring the positions corresponding to the wild card 
symbol. It is to be noted that if a contains only then simvec{d, a) is the 
empty sequence, i.e. it is of length zero. 

Let F be a distance measure over real vectors assigning a positive 
real value less than or equal to one, i.e., F is a function which asso- 
ciates a real value F{x,y), such that < F{x,y) < 1, with every pair of 
real vectors x, y of same length. Given a database sequence d and a se- 
quence a of equal length over A U {0}, we define dist{d,a, F) as follows: 
if d and a are of different lengths then dist{d, a, F) = oo; if d and a 
are of the same length and simvec{d, a) is not the empty sequence then 
dist{d,a, F) = F{simvec{d,a),l) where 1 denotes a vector, of the same 
length as simvec{d, a), all of whose components are 1; if d and a are of the 
same length and simvec{d, a) is the empty sequence then dist{d, a, F) is de- 
fined to be 0. Thus it is to be seen that if d and a are of the same length then 
dist{d, a, F) has value between and 1, otherwise it has value oo. The value 
of oo is given, in the later case, in order to distinguish it from the case when 
d and a have equal lengths and F{simvec{d, a), 1) = 1 and also for technical 
convenience. 

Let L be a language over A, be a database sequence, and F be a vec- 
tor distance function. We define two distance measures, distancei{d, L, F) 
and distance2{d, L, F), of d with respect to L using the vector distance 
function F. If L is non-empty then distancei{d, L, F) is defined to be 
mm{dist{d, a,F) : a & L and length{a) — length{d)}, otherwise it is defined 
to be oo. It is to be noted that if L does not contain any strings of the same 
length as d then distancei{d, L, F) is oo. 

The definition of distance2{d, L, F) is more complex and is motivated by 
the following situation. Suppose each atomic query in A denotes a logical 
predicate and the disjunction of all these predicates is a tautology, i.e. it is 
always satisfied. As an example, consider the case when A = {P, -^P} where 
P is an atomic proposition. Now consider the language Li = {aP : a G A}. 
The language Li requires that the atomic proposition P be satisfied at the 
second state irrespective of whether P is satisfied or not at the first state. It 
can be argued that the distance of a database sequence d with respect to Li 
should depend only on the distance of the second database state with respect 
to P. This intuition leads us to the following definitions. 

Recall that L is a language over A. Let closure{L) be the smallest lan- 
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guagc L' over A U {0} such that L C L' and the following closure condition 
is satisfied for every G (A U {(p})*'- if aaP G L' for every a G A then 
a(f)P G L'. (Recall cf) is the wild card symbol). Now we define a partial order 
< on the set of sequences over A U {0} as follows. Let a = {ao, Om-i) 
and f3 = (/9o, /9n-i) be any two sequences over A U {</)}. Intuitively, a < f5 
if a can be obtained from f3 by replacing some of the occurrences of the wild 
card symbol in /5 by a symbol in A. Formally, a < (3 iS m = n and for 
each i = 0,l,...,n — 1 either ai — Pi or (3i — (f), and there exists at least one 
value of j such that aj G A and Pj — 0. 

It is not difficult to see that < is a partial order. Let S be any set of 
sequences over A U {</)}. A sequence a E S is called maximal if there does 
not exist any other sequence P E S such that a < p. Let maximal (S) de- 
note the set of all maximal sequences in S. Now we define distance2{d, L, F) 
to be the value of distancei{d, maximal {dosur e{L)), F). For the language 
Li (given above) maximal {closure{L)) consists of the single sequence 0P; 
for a database sequence d of length two, it should be easy to sec that 
distance2{d, Li, F) equals the distance of the second database state with 
respect to P. 

It is to be noted that the distance functions distancei and distance2 
depend on the vector distance function F. Wc consider a spectrum of vector 
distance functions {F^ : k = 1,2,.. .,00} defined as follows. Let x.yhe two 
vectors of length n. The value of Ff;{x, y) is given as follows. For k 00, 

F,{x,y) = ^ ^o<i<ni\xi-m\r y/, 
n 

Fooix, y) = max{|xi - yi\ : < i < n} (2) 

Note that Fi is the average block distance function and F2 is the mean 
square distance function, etc. It can easily be shown that Foo{x,y) — 
limk^^ Fk{x , y) . Note that Fi{x,y) gives equal importance to all compo- 
nents of the vectors; however, as k increases, the numerator in the expression 
for Fk{x,y) is dominated by the term having the maximum value, i.e. by 
max{|xj — yi\ : 1 < i < n}, and in the limit Foo{x,y) equals this maxi- 
mum value. Thus we see that Fi and F^o are the extremes of our distance 
functions. We call Fi as the average block distance function and F^x, as the 
infinite norm distance function. 

The following lemma can easily be proved from our earlier observations. 
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LEMMA 2.1: For every database sequence d and language L over A the 
following properties hold. 

1. distance2{d, L, Foo) < distancei{d, L, F^o). 

2. For any i,j e {1,2, ...,oo} such that i < j, distancei{d, L, Fj) < 
distancei{d, L, Fj) and distance2{d, L, Fi) < distance2{d, L, Fj). 

Proof: Part 1 of the lemma is proved by the following argument. Let 

length{d) = n. If a, (3 are two sequences of length n over A U {0} such 
that a < j3 or a = f] then dist{d,a, Foo) > dist{d, (3, Foo) (To see this, 
let a = (cto, ctj, q;„_i) and /3 = {Pq, Pi, Pn~i)', since, for each i, 
< i < n, either aj = Pi or Pi = (p, and simval{di, 0) = 1, it is the case that 
(1 — si'mval{di, ai)) > (1 — simval{di, Pi)); Since dist{d, a, F^) — max{(l — 
simval{di,ai)) : < i < n} and dist{d, P, Foo) = max{(l — simval{di, Pi)) : 
<i < n}, it follows that dist{d,a, F^o) > dist{d, P, F^o)', it is to be noted 
that this relation will not hold, in general, if we replace F^o by Fk for any 
k < oo). From the definition of maximal{closure{L)), we see that, for every 
a & L there exists a string P e maximal{closure{L)) such that a < P or a — 
p. From this we see that, for every a e L of length n, there exists a string P G 
maximal{closure{L)) of length n such that dist{d,a, Foe) > dist{d, P, F^o)- 
As a consequence, mm{dist{d,a, F^o) : a E L} > mm{dist{d, P, F^o) : P G 
maximal {closure{L))} . Now from the definitions, we see that the the left 
and right hand sides of the above inequality are exactly distancei{d, L, F^) 
and distance2{d, L, F^^) respectively. Part 1 of the lemma follows from this. 
Part 2 of the lemma follows from the well known fact that for any two n- 
vectors vectors a, b, each of whose components lie in the positive unit interval, 
Fi{d, b) < Fj{d, b) for i < j. □ 

It is to be noted that part 1 of the lemma does not hold, in general, if we 
replace F^ by F^ for any k < oo. The following is a simple counter example 
for this. Let A = {a,b} and L = {ab,bb}. Let d = {dQ,di). Assume that 
simval{dQ,a) and simval{dQ,b) be both equal to i, and simval{di,b) = 0. 
It should be easy to see that distancei{d, L, Fi) = |. It should also be 
easy to see that maximal{closure{L)) contains the single string (pb. Hence 
distance2{d, L, Fi) — dist{d,(f)b, Fi) which equals 1. Thus, in this case, 
part 1 of the lemma does not hold when we use Fi. In fact, in this case, 
part 1 of the lemma does not hold for any Fk where A; < oo. On the con- 
trary, distancei{d, L, Fi) < distance2{d, L, Fi). We can also give an exam- 
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pie for which distance2{d, L, Fi) < distancei{d, L, Fi). Thus, in general, 
distancei{d, L, Fk) and distance2{d, L, F^) are not related for any k < oo. 

3 Automata 

In this section, we consider automata for specifying queries over sequences. 
We give algorithms for computing the two distances of a database sequence 
with respect to a given automaton. 

An automaton A is 5-tuple {Q, A, 6, 1, Final) where Q is a finite set of 
states, A is a finite set of symbols called the input alphabet, S is the set of 
transitions, 7, Final C Q are the set of initial and final states, respectively. 
Each transition of A, i.e. each member of S, is a triple of the form {q, a, q') 
where q,q' & Q and a G A; this triple denotes that the automaton makes a 
transition from state q to q' on input a; we also represent such a transition as 
Q -^a q'- Each input symbol represents an atomic predicate (also called an 
atomic query in some places) on a single database state. For example, in a 
stock market database, price{ibm) = 100 represents an atomic predicate. In 
a textual database, each database state represents a document and a database 
sequence represents a sequence of documents; here an atomic predicate may 
state that the document contain some given key words. In a video database, 
which is a sequence of images or shots, each atomic predicate represents a 
condition on a picture such as requiring that the picture contain some given 
objects. 

Let a = ao, Oi, a^-i be a sequence of input symbols from A and q, q' 
be states in Q. We say that the sequence a takes the automaton A from 
state q to q' if there exists a sequence of states qo,qi, ■■■,qn such that qo — Q 
and qn — q' and for each i = 0, n — 1, — >a. qi^i is a transition of A. For 
any state q, we let T{q) denote the set of sequences that take the automaton 
from state g to a final state. We say that the automaton A accepts the string 
a if there exists an initial state q such that a e T{q). We let L{A) denote the 
set of strings accepted by A. We let |.4| denote the number of its states, i.e., 
that cardinality of Q, and Size{A) denote the sum of its number of states 
and transitions, i.e., the sum of the cardinalities of Q and S. 

We identify vectors with sequences. We let 1 denote a vector all of whose 
components are 1. The length of such a vector will be clear from the context. 

For a database sequence d, automaton A, and a vector distance func- 
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tion F, we define two distances distancei{d, A, F), for i = 1,2 as follows: 
distancei{d, A, F) = distancei{d, ^(.4), F). 
Algorithms for Computing the distances 

Now we outline an algorithm 

for computing the value of distancei{d,A, Foe)- Recall that F^o is the in- 
finity norm vector distance measure. Let d = {do, ...,dn-i) be the database 
sequence. Essentially, the algorithm computes the distances of the suffixes 
of d with respect to the states of the automaton for increasing lengths of 
the suffixes. For any automaton state q and integer i {0 < i < n — 1), the 
algorithm first computes the value of distancei{d[i],T{q), Foo) in decreasing 
values of i starting with i = n—1. (Recall that d[i] is the suffix of d starting 
with di and T(g) is the set of sequences accepted by A starting in the state 
q). The algorithm finally computes distancei{d, A, F^) to be minimum of 
the values in {distance{d[0],T{q), Foo) : g G /}; note that d[0] is simply d. 
The values in the set {distancei{d[i],T{q), F^o) : q G Q} are computed in 
decreasing values of i using the recurrence equation given by the following 
lemma. 

LEMMA 3.1: Let q be any state in Q and q — qi, ...,q -^am Qm be all 
the transitions in 5 from the state q. Then the following properties hold. 

1. For < i < n — 1, distancei{d[i],T{q), Foo) — min{a;i, ...,Xm} where 
Xj — max{(l — sim.val{di, aj)), distancei{d[± + i\,T{qj), F^)} for j — 
l,...,m. 

2. If there is at least one j such that qj e Final then 
distancei{q, d[n — l], Foo) = 

min{(l — simval{dn-i,aj)) : qj G Final}, otherwise 
distancei{q, d[n — l], Foo) — oo- 

Proof: Part 1 of the lemma is seen as follows. Assume i < n — 1 and 

observe that T{q) = Ui<j<m(oj^(Q'j))- Hence distancei{d[i],T{q), Fr^) = 
mm{distancei{d[i], a,jT{qj) , F^) : 1 < j < m}. It should not be difficult to 
see that for each j, I < j < m, distancei{d[i],ajT{qj), Foo) is Xj. Part 2 of 
the lemma follows from the fact that the set of strings of unit length in T(q) 
is the set {aj : qj G Final}. □ 

It is easy to see that the complexity of this algorithm is 0{n ■ Size{A)). 
Thus the algorithm is of linear complexity in n and in Size{A). A formal 
description of the algorithm is given in [HSOO] . 
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Now we show how to compute distancei{d. A, Fk) for any k such that 
Q < k < oo. For a sequence a over A U {0}, let elength{a) denote the num- 
ber of values of i such that Oj e A, i.e., a, is not the wild card symbol. For 
any database sequence d as given above and any sequence a — ao,...,a„_i 
over the alphabet A U {0} and for any k > 0, define an un- normalized dis- 
tance 7jdistk{d,a) as follows: udistk{d,a) = So<j<n(l — smival{di^ai))^ . It 
should be easy to see that, if elength{a) > then distancei{d,a, F^) — 
i 'dengthla) )^^*'' elength{a) — then distancei{d,a, Ff.) — 0. For a 
database sequence d and a language L over A U {(f)}, we define a set 
Udist-elengthk{d, L) of un- normalized distance and effective length pairs as 
follows; Udist-elengthk{d, L) ={{x,l) : 3a ^ L such that elength{a) = I and 
X is the minimum value of udistk{d, h) of all 6 e L whose effective length is 
Z, i.e., elengthib) = I}. 

The following lemma shows how distancei{d, A, Fx) can be computed 
from the sets Udist-elengthk{d[Oi\,T{q)) for each initial state q, i.e., q & L 

LEMMA 3.2: The following properties hold. 

1. If the sets Udist-elengthk{d[0\,T{q)) for each initial state q are all 
empty then distancei{d, A, F^) — 1. 

2. If (0,0) e Udist-elengthk{d[0],T{q)) for some q e I then 
distancei{d, A, F^) — 0. 

3. If none of the above conditions holds then distancei{d, A, Fk) = 
min{(y)fc : {x,l) G Udist_elengthk{d[0], q) for some q G /}. 

Proof: The condition of part 1 indicates that there are no strings of the 
same length as d that are accepted by A and hence distancei{d,A, Fk) = 1. 
The condition of part 2 indicates that there is a string containing only the 
symbol (f) that is of the same length as d that is accepted by A and hence 
distancei{d, A, Fj.) ~ 0. Part 3 of the lemma follows from the definitions. □ 

The following lemma leads to a method for computing the sets 
Udist-elengthk{d[i],T{q)) for each state q in decreasing values of i. 

LEMMA 3.3: Let q be any state in Q and q -^ai Qi, ■■■,<! Qm be all 
the transitions in S from the state q. Then the following properties hold. 

1. Let i be an integer such that < i < n — 1 and U — 
{{x -|- (1 — simval{d[i],aj))'', I + 1) '1 < j < m and 
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aj ^ (f) and (x.l) G Udist_elengthk{d[i + l],T{qj))} U {{xj) e 
Udist-elengthk{d[i + l],T{qj)) : 1 < j < m,aj = (f)}. Then, 
Udist_elengthk{d[i\,T{q)) —{{x,l) & U : x is the minimum of all 
pairs of the form (y, Z) e U}. 

2. If none of the Qj is in Final then Udist-elengthk{d[n — l],T(g)) con- 
tains the single element (oo, 1). Otherwise, if 3j such that 1 < j < m 

and aj = and qj G Final then Udist_elengthk{d[ia. — l],T{q)) con- 
tains the pair (0,0); if 3j such that 1 < j < m and aj ^ (j) and 
qj G Final then Udist-elengthk{d[n — l],T(g)) contains the pair (x, 1) 
where x = min{(l — simval{d[n — l], aj))'' : aj ^ (j) and qj G Final}. 

Proof: Part 1 of the lemma follows from the definition of 
Udist-elengthk{d[i],T{q)) and the fact that T{q) = Ui<j<^ajT(gj). Part 
2 of the lemma follows from the observations. If none of the qj is in Final 
then T{q) has no strings of length 1 and hence Udist_elengthk(d[n. — i\,T(q)) 
contains the single clement (cxd,1). If 3j such that 1 < j < m and 
aj = (p and qj G Final then the string of length 1 is in T{q) and 
hence (0,0) G Udist_elengthk{d\n. — l],T(g)). If 3j such that 1 < j < m 
and aj 7^ and qj G Final then from the definitions it is seen that 
{x,l) G Udist_elengthk{d[n — l\,T{q)) where x is as given in the lemma. 
It is to be noted that Udist-elengthk{d[n — i\,T{q)) contains at most two 
elements. □ 

It is to be noted that for any element {x, I) in Udist-elengthk{d[i],T{q)), 
< I < n — i. Further more, for any two elements {x, I) and {x', I') 
in Udist_elengthk{d[±],T{q)) it is the case that I ^ V . As a conse- 
quence, f/(iist_e/en5ft/ifc(d[i], T(gf)), has at most n — i -|- 1 elements. Us- 
ing part 2 of the above lemma, we see that the values of of the set 
{[/dist_e/engit/iyk(d[n — 1], T(g)) : q G Q] can all be computed in time 
0{Size{A)). Using part 2, of the lemma, we see that the values in 
the set {C/disi_e/eng't/ifc(d[i], T(q')) : g G Q} can be computed from 
the values in the set \JJ dist_elengthk{d\\ ^ \\. T{^q)) : q G Q} in time 
0((n — i) ■ Size{A)). Using the last step repeatedly, we see that the val- 
ues in the set {Udist_elengthk{d[0],T{q)) : q E Q} can be computed in 
time 0{n^ ■ Size{Aj). Hence distancei{d, A, Fk) can be computed in time 
0{n'^ ■ Size{A)) where n — length{d). Thus the algorithm is of complexity 
quadratic in the length of d and linear in Size{A). 
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If the input symbol does not appear in any string in the language L{A) 
then we can delete all transitions on the symbol 4> from the transition set S 
of the automaton A. The resulting automaton A has no transitions on 0. 
In this case, it is not difficult to see that, for each i = 0, — 1, the set 
Udist_elengthk{d[i],T{q)) has at most one element. Hence, the set of values 
{Udist_elengthk{d[i],T{q)) : q G Q} can be computed from the values in 
the set {Udist-elengthk{d[i + l],T{q)) : q E Q} in time 0{Size{A)) only. 
As a consequence the set of values {Udist_elengthk{d[0],T{q)) : q e Q} can 
be computed in time 0(n ■ Size{calA)). Hence distancei{d, A, F^) can be 
computed in time 0{n ■ Size{A)). Thus, the resulting algorithm is only of 
linear complexity in the length of d as well as in Size{A). 
Computing the value of distance2{d, A, F) 

The proof of the following lemma gives a method for computing 
distance2{d,A, F) where F is any of the vector distance functions given pre- 
viously. The complexity of the algorithm is triple exponential in the number 
of states of A, and linear or quadratic in the length of d. 

LEMMA 3.4: For a database sequence d of length n, automaton A with 
m number of states and vector distance function F, there exists an algorithm 

that computes distance2{d, A, F) which is of complexity 0{2 -pin)) where 
p{n) is n if F = F^o, and is if F = Fi for i < oo. 

Proof: We prove the lemma by giving an algorithm, of the appropri- 
ate complexity, that computes distance2{d, A, F). From the definition, we 
have distance2{d,A, F) = distancei{d,maximal{dosure{L{A))), F). Let 
A' = A U {(f)} where is the wild card symbol. Recall that the elements of 
dosure{L{A)) are strings over the alphabet A'. We construct an automa- 
ton H that accepts the language maximal {closure{L A)). Then we simply 
compute distance2{d, A, F) to be the value of distancei{d,T-C, F). 

Now we show how to compute the automaton H. First we compute the 
automaton A which accepts the complement of the language accepted by 
the automaton A. Prom this automaton, we construct another automaton C 
which accepts the complement of the language closure{L{A)) , i.e., the lan- 
guage closure{L{A)). The construction of C uses the following fact. If a 
string a G (A')* is in closure{L{A)) then there exists another string /3 G A* 
obtained from a by replacing each occurrence of in it with some sym- 
bol in A such that /3 G L{A). C simulates A on an input string with the 
following modification. Whenever it sees the input symbol 0, it replaces 
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it, non-deterministically, by some symbol from A and simulates A on the 
guessed symbol. It accepts it if A accepts. It is not difficult to see that C 
accepts a string a over the alphabet A' iff there exists a string /3, obtained 
by replacing every occurrence of the symbols in a by some symbol from 
A, which is accepted by A. Hence, it is easy to see that C accepts the the 
language closure{L{Aj). It is not difficult to see that we can obtain such an 
automaton C such that \C\ = |^|. 

Next we construct the automaton C which accepts the complement of the 
language accepted by C, i.e., which accepts the language dosure{L{A)). It is 
to be noted that \C\ < 21^' and hence \C\ < 2l^l. Similarly, |^| < 21-^1. From 
this we see that \C\ < 2^' ' and hence \C\ < 2'^"'. 

Using C, we construct an automaton V which accepts the complement 
of the language maximal {dosure{L{A)). The automaton T> on an input 
string a over A' acts as follows. It accepts a if either a ^ closure{L{A)) , or 
there exists another string j3 G closure{L[A)) such that a < f3. The former 
condition is checked by simulating C over a. To check the later condition, V 
non-deterministically changes at least one of the input symbols in a, which 
is an element of A, to and checks that the resulting string is accepted 
by C, i.e., is in closure{L{A)) . It is easy to see that we can construct such 
an automaton T) such that its number of states is linear in the number of 
states of C, and hence is double exponential in m. Next we construct the 
complement V of V. Clearly V accepts the language maximal{closure{A)) . 
We take H to be the automaton V. Clearly, \Ti,\ < 2 . Note that Size{H) 
which is the sum of its number of states and transitions is quadratic in |7Y|. 

Hence Size{n) < 2^ . 

For each i — l,...,oo, we compute distance2{d, A, Fi) to be 
distancei{d,TC, Fi). For i = oo, the complexity of the algorithm is 

0{n ■ Size{H)) and hence is 0(n ■ 2 ). . For i < oo, the complexity 

is 0{n^ ■ Size{H)) and hence is ^(n^ -2^ ). □ 

4 Temporal Logic 

In this section, we consider linear Temporal logics as one of the formalism for 
specifying queries over database sequences. Such logics have been extensively 
used in specification of properties of concurrent programs [MP92]. They 
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have also been used in database systems for specifying queries in Temporal 
databases [Ch092a, Cho92b] and for specifying triggers in active database 
systems [SW95a, SW95b]. We assume that we have a finite set V whose 
members are called atomic propositions. Each member of this set denotes 
an atomic predicate over a database state. Formulas of Temporal Logics 
(TL) are formed from atomic propositions using the propositional connectives 
A, V, -1 and the temporal operators Nexttime ("nexttime") and Until 
( "until" ) . The set of formulas of TL is the smallest set satisfying the following 
conditions. Every atomic proposition is a formula of TL; both true and false 
are formulas; if g and h are formulas of TL then g Ah, g\/ h, -ig, Nexttime g 
and g Until h are also formulas of TL. For a formula /, we let length{f) 
denote its length. 

Given a database sequence d — {do,di, ...,dn-i) and a temporal for- 
mula /, and a vector distance function F, we define a distance function 
syndist{d, f, F) inductively based on the syntax of / as follows. 

• For an atomic proposition P, syndist{d, P,F) — 1 — si'mval{do, P). 

• syndist{d, g A h, F) = msix{syndist{d, g, F), syndist{d, h, F)}. 

• syndist{d, gV h, F) — mm{syndist{d,g,F),syndist{d,h,F)}. 

• syndist{d,-'g) = 1 — syndist{d, g , F) . 

• syndist{d, Nexttime g,F) = sy ndi st {d[l\, g, F) if length{d) > 1; 
otherwise, syndist{d, Nexttime g, F) = oo. 

• syndist{d, g Until h,F) = min{F{Ui, 1) : < i < n} where Ui is the 
vector {uifi,Ui^i, ..,Ui^i) whose components arc given as follows. Ui^i = 
l—syndist{d[i], h, F) and for j, < j < i, Uij = l — syndist{d[j], g, F). 
Intuitively, this definition corresponds to the exact semantics of Until . 

Now, we define two types of semantic distance functions between a 
database sequence d and a TL formula /. To do this, we need the fol- 
lowing definitions. Let A be the set of all subsets of atomic propositions, i.e. 
A = 2^. Let s = {sq, ...,Sn-i) be any sequence over A. Now we define the 
satisfaction of / at the beginning of s inductively on the structure of / as 
follows. 
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• For an atomic proposition P G V, s satisfies P if P e sq. 

• s satisfies g f\h\i s satisfies both g and h. s satisfies gV h ii s satisfies 
either g or h. 

• s satisfies -ig if s does not satisfy g. 

• s satisfies Nexttime g if n > and s[l] satisfies g. 

• s satisfies g Until h if there exists an i < n such that s[i] satisfies h 
and for all j, < j < i, s[j] satisfies g. 

We say that two TL formulas / and g are equivalent if the sets of sequences 
(over A) that satisfy them are identical. Let / be a TL formula and V = 
{Po, Pi, Pm-i} be the set of atomic propositions that appear in /. Let 
s — {sq, Sn-i) be any sequence over A. Let ^' be the set consisting of the 
elements of V and negations of elements in V. Formally, ^ — V L} {-'Pi : 
< i < m}. Now we define a sequence expn{s) over \l/ which is obtained 
from s by expanding each Sj in to a subsequence of length m whose j*'^ 
element is Pj or -iPj depending on whether Pj is in Sj or not. Formally, 
expn{s) = (to, ■■■,tnm-i) is a sequence of length nm defined as follows: for 
each i,j such that < i < n and < j < m, if Pj e Sj then tim+j = Pj, 
otherwise tjm+j = ~'Pj- 

For a TL formula /, let L{f) = {expn{s) : s satisfies /}. Note that 
L{f) is a language over ^. For any positive integer r, let Lj.{f) = {expn{s) : 
length{s) = r and s satisfies /}. Lr{f) corresponds to those sequences of 
length r that satisfy /. 

Let d = {do,di, ...,dn-i) be a database sequence. Now, we define an- 
other database sequence expn{d) obtained from d by repeating each suc- 
cessively m times (recall m is the number of atomic propositions), i.e., 
expn{d) = ((do)", (^^i)", (^^i)", (dn-i)™))- Let / be a TL formula and 
P be a vector distance function. Now, for each j — 1,2, we define a se- 
mantic distance of d with respect to / and F (denoted by semdistj{d, /, P)) 
as follows: semdistjld, f, F) = distance j (expn{d), Ln[f), F). Recall that 
distanccj is defined in the previous subsection. 

It is to be noted that we have assumed the set V to be exactly the set of 
atomic propositions that appear in /. However, if we take V to be any super 
set of the set of atomic propositions appearing in /, then it is easy to see 
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that the syntactic distances, i.e., syndist{d, f, Fk) for k such that 1 <k < oo, 
remain the same. It can be shown that similarly semdist2{d, f, Fj.) for all 
k — l,...,oo and se'mdisti{d, f, F^) remain the same. That is all these 
distance measures depend only on the similarity values of atomic propositions 
that appear in / and not on other atomic propositions. On the other hand, 
this property does not hold for the distance measures distancei{d, f, F^) for 
k ^ oo. 

The semantic distances of a database sequence with respect to equivalent 
TL formulas are equal (i.e., if / and g are equivalent then semdistj{d, /, F) = 
semdistj{d, g, F)). However this property does not hold for syntactic dis- 
tances. For example, the syntactic distance of a database sequence with 
respect to the two equivalent formulas {P AQ)\/ {P A-'Q) and P may be differ- 
ent. The following lemma shows that syndist{d, f, F^) < se'mdist2{d, /, F^^). 
The lemma can be proven by induction on the structure of the formula /. 

LEMMA 4.1: For any database sequence d and TL formula / in 
which all negations are applied to atomic propositions, syndist{d, /, Fqo) < 
semdisti{d, /, Fqo). 

Proof: The proof has two steps. In the firs step, we show that any 
formula g, in which all negations are applied only to atomic propositions, 
can be transformed to a formula G{g), that has no Until operator appearing 
in it and in which all negations appear only to atomic propositions, such 
that syndist{d, g, Foo) = syndist{d,G{g), Foo) and semdisti{d, g, Foo) — 
se'mdisti{d,G{g), Foo)- Let n — length{d). The formula G{g) is defined 
inductively on the structure of g as follows. If g is an atomic proposition 
or the negation of an atomic proposition then G{g) = g. \i g = gi A g2 or 
9 = giy g2 or g = Nexttime gi then G{g) = G{gi) A G{g2) or G{g) = 
G{gi) V G{g2) or G{g) = Nexttime G{gi) , respectively. \i g = gi Until g2 
then G{g) = Vo<i<n ((Ao<i<i( Nexttime Ygi) A ( Nexttime y{g2))- In the 
above definition ( Nexttime Y denotes a string of j, Nexttime operators. 
It is to be noted that in the definition of G{g) for g = gi Until g2, we are 
replacing the Until operator by n disjuncts (recall that n = length{d)); 
the i*^ disjunct asserts that g2 is satisfied after i database states and at all 
the intermediate states gi is satisfied; also note that the i*'* disjunct has 
i conjuncts. By a simple induction on the structure of g, one can easily 
show that syndist{d, g, Foo) = syndist{d,G{g), Foo)- It is to be noted that 
this property is not satisfied if we replace by any other F^. It can also 
be shown that for any sequence s of length n over A, s satisfies g iS s 
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satisfies G{g); i.e., the set of sequences of length n that satisfy g is same 
as the set of sequences of length n that satisfy G{g). As a consequence, 
se'mdisti{d, g, F^) — semdisti{d, G{g), F^). 

Now we show, by induction, that for any formula g that does not contain 
any Until operator and in which all negations are applied only to atomic 
propositions, syndist{d, g, F^) < semdisti{d, g, Foo)- For the base case, i.e., 
when g is an atomic proposition or the negation of an atomic proposition 
the property trivially holds. Now consider the cases when g = gi V g2, 
or g = gi A g2- As induction hypothesis, assume syndist{d, gi, Foo) < 
semdisti{d, gi, F^) and syndist{d, g2, Foo) < se'mdisti{d, g2, Foo). Observe 
that mm{syndist{d, gi, Foo), syndist{d, g2, Foo)} 

< mm{semdisti{d, gi, Foo), semdisti{d, g2, Foo)}- From the definitions, 
we see that the left hand side of the above inequality is syndist{d, gi V 
92, Foo) and the right hand side equals semdisti{d,gi V g'2,Foo)}- Hence, 
it is the case that syndist{d, {gi V 5'2))-^oo)< semdisti{d, {gi V (72), -^oo)- 
It is also easy to see that max{syndist{d, gi, Foo), syndist{d, g2, Foo)} 

< ma.x{semdisti{d, gi, Foo), semdisti{d, g2, Foo)}- The left hand side 
of this inequality is syndist{d, {gi A g2),Foo)- We show that its 
right hand side is less than or equal to semdisti{d,{gi A g2),Foo)- 
Let Xi = Ln{gi) and X2 = Ln{g2)- From the definitions, 
we have semdisti{d, {gi A g2),Foo) = mm{dist{expn{d), s, Foo) '- 
s E Xi n X2}, semdisti{d, gi, Foo) = m.m{dist{expn{d), s, Foo) '- 
s e -'^i}, and semdisti{d,g2,Foo) — vain.{dist{expn{d) , s , Foo) '- 
s e X2}- From these we see that both semdisti{d, gi, Foo) and 
semdisti{d, g2, Foo) are less than or equal to semdisti{d, {gi A g2),Foo)- 
Hence max{sem(iisti((i, (^i, Foo), semdisti{d, g2, Foo)} is less than or equal 
to semdisti{d, gi A g2, Foo)- 

Now consider the case when g — Nexttime gi- If length{d) < 1 
then syndist{d, g, Foo) — and in this case semdisti{d, g , Foo) is also 
00 because all strings that satisfy g are of length at least two. Now con- 
sider the case when length{d) > 2 and as induction hypothesis assume 
that sy ndi st {d[l], g I, Foo) < semdisti{d[l], gi, Foo)- From the definitions, 
we see that syndist{d, g, Foo) equals the left hand side of this inequality. We 
show that the right hand side is less than or equal to semdisti{d, g, Foo) 
and from this it would follow that syndist{d, g, Foo) < semdisti{d, g , Foo) - 
Let Gi = Ln-i{gi) and G = Ln{g)- From the definition of Ln{g), 
we see that G = {expn{6)t : 5 e A and t G Gi}- By defini- 



17 



tion semdisti{d[l], gi, Foo) = ram{dist{expn{d),u, F^d) '■ u G Gi} and 
semdisti{d, g, Foo) = mm{dist{expn{d),u, F^o) '■ u G G}. Let s G Gi be 
the string such that dist{expn{d[l]) , s, F^) — mm{dist{expn{d[i\) , u, F^) : 
u G Gi}. Now, it is not difficult to sec that semdisti{d, g, Foo) = 
mm{dist(expn{d), expn{6)s, F(x,) : S G A} and hence semdistild, g, F^jq) > 

semdisti{d[l], gi, Foo)- □ 

It is to be noted that syndist{d, f, Fi) < syndist{d, f, Fj) for all i,j 
such that 1 < i < J < oo. Also semdist2{d, f, F^) < semdisti{d, f, F^) 
and semdisti{d, f, Fi) < semdisti{d, f, Fj) and semdist2{d, f, Fi) < 
semdist2{d, f, Fj) for all i,j such that i < j < oo. These results follow 
directly from lemma 2.1. The only known non-trivial relationship between 
syntactic and semantic distances for temporal formulas is the one given by 
lemma 4.1. For example, in general, we believe that neither the relation 
syndist{d, f, F^) < semdist2{d, f, F^) nor the reverse relationship holds. 
Similarly, in general, for i < oo, we can not relate syndist{d, /, Fi) with 
either semdisti{d, /, Fi) or semdist2{d, /, Fi). 

Algorithm for computing the Syntactic distance 
Now we present algorithms for computing the syntactic and the seman- 
tic distances. First we present the algorithms for computing the syntactic 
distances. 

LEMMA 4.2: Given a database sequence d = {do, dn-i) and a TL 
formula / and given the similarity values of the database states in d with 
respect to the atomic propositions appearing in /, there exists an algorithm 
that computes syndist{d, f, F^) in time 0{n ■ length{f)). For each A;, < 
k < oo, there exists an algorithm that computes syndist{d, /, F^) in time 
Oin"^ -lengthif)). 

Proof: Let d, f be as given in the lemma. For each i, < i < n and 
for each g which is an atomic proposition or its negation, let simval{di, g) 
be the similarity value of g in the database state dj. Let SF{f) be the set 
of all sub-formulas of /. Let k be any integer such that < A; < oo. For 
each g G SF{f) and for each i = 0, ...,n — 1, we compute syndist{d[i], g, F^) 
inductively on the length of g as follows. The algorithms for computing 
syndist{d[i.], g, Fk) is same for all k in all cases excepting in the case when g 
is of the form gi Until g2- 

• When g is an atomic proposition or its negation, syndist{d[i], g, F^) = 
1 — simval{di, g). 
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When g = gi A g2, we compute syndist{d[i\, g, Fk) to be the maximum 
of syndist{d[i], gi, Fk) and syndist{d[i], g2, Fk). 



• When g = gi V g2, then we compute sy ndi st {d[i], g, Fk) to be the 
minimum of syndist{d[i], gi, Fk) and syndist{d[i], g2, Fk) ■ 

• When (? = h\extt\me gi, syndist{d[i], g, Fk) is taken to be 1 for i = n— 1, 
and it is taken to be syndist{d[i + l],gi, Fk) for i < n — 1. 

• For the case when g = gi Until g2 we do as fol- 
lows.. We compute sy ndi st {d[i], g, Fk) for decreasing values 
of i. We first give the method for the case when k — 
oo. The value of syndist{d[n — 1], g, F^^,) is computed to be 
syndist{d[n— i\,g2,F^). For i < n — 1, syndist{d[i\, g, Foo) is com- 
puted to be the minimum of the two values — syndist{d[i], g2, F^o) and 
max{syndist{d[i], gi, Foo) , syndist{d[i + l], g, F^)}. It is easy to see 
that this procedure only takes 0{n) time. 

For k 7^ cxo. wc compute the values {syndist{d[i], g, Fk) : < i < n} 
from the values {syndist{d[i], gi, Fk), syndist{d[i], g2, Fk) : < i < n} 
as follows. Let surriij be the sum of {syndist {d[r], gi, Fk))'' for all val- 
ues of r such that i < r < j. Let y^j = ( ^^m,,,+.i/ndg(d[j],g.,F,) ^ i _ 

We compute sy ndi st {d[i], g, Fk) to be minimum of the values yij for 
j = i, ...,n — 1. From the definitions it is not difficult to see that this 
procedure correctly computes syndist{d[i], g, Fk). It is not difficult to 
see that all the values of surriij and yij for j = i, n — 1 can be com- 
puted in time 0{n). Thus this step for computing syndist{d[i], g, Fk) 
takes 0(n) time. Computing all the values in {syndist{d[i\, g, Fk) : 
<i < n} takes O(n^) time. 

It is easy to see that the above algorithm correctly computes the syntactic 
distances. The complexity of the algorithm is 0{length{d) ■ length{f)) for 
k = oo, and for all cases when k < oo the complexity of the algorithm is 
0{length{d)'^ -lengthif)). □ 

Computing the Semantic Distances 

Let d — {do, di, dn-i) be a database sequence and / be a TL formula. 
We take an automata theoretic approach for computing semdistk{d, f, Fi) 
for i = 1, oo. For this we use the well known result (see [VWS83, ES83]) 
that shows that there exists an automaton A such that |^| < and 
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L{A) = L{f). The value of symdisti{d, f, Fi) is computed as the value 
distancei {d, A, Fi) using the algorithm given in section 3. 

To compute the value of semdist2{d, f, F^) also, we use the approach 
given in section 3. This approach uses the complement A of the automaton 
A. The automaton A accepts the language L{A), i.e., the complement of the 
language ^(.4); this is exactly the set of sequences that satisfy the formula 
-i/. Thus we can take A to be the automaton that accepts the set of all se- 
quences that satisfy -i/. Using the approach given in [VWS83, ES83] we can 
obtain such an automaton whose number of states is 

automaton, we can apply the procedure given in section 3. The complexity 
of the resulting algorithm will be 0{length{d) ■ '''"""'^■''^ ) for the case when 
we use the distance function Foe,; for all other distance functions F^ {i < cxd), 
the complexity is 0{length{dY ■ ) _ 

5 Regular Expressions 

In this section we consider regular expressions (REs) as query languages and 
define syntactic and semantic distances of a database sequence with respect 
to REs. Let A be a finite set of atomic queries. The set of REs over A is the 
smallest set of strings satisfying the following conditions. Every element of A 
is a RE; if g and h are REs then {g\/h),gh and (g)* are also REs. With each 
RE / over A, we associate a language L{f) over A defined inductively as 
follows. For every a G A, L(a) = {a}. For the REs g,h, L[gh) — L{g)L{h), 
L{gyh) =L{g)UL{h),L{{gy) ={L{g)y. 

Let d = {do, dn-i) be a database sequence and / be a RE. For each k — 
1, 00 we define a syntactic distance function, denoted by syndist{d, /, F^) 
inductively on the structure of / as follows. First we define this for the case 
when k ^ 00. 

• For a e A, syndist{d,a, Fk) — 1 — simval{dQ,a) when n — 1, i.e. 
length{d) — 1; otherwise syndist{d, a, F^) — 00. 

• syndist{d, g\/ h, Fk) = mm{syndist{d, g, Fk), syndist{d, h, Fk)}. 

• syndist{d, gh, Fk) = minK ^*^"^*^*-"-*'" +'-<^"-9tHf^)-" . p g^^.^^ possibly 
null, database sequences such that = d and u = syndist{a, g, Fk) 
and V = syndist{f3, h, Fk)}. 
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• We define syndist{d, (g)*, Fk) as follows. If d is the null string 
then syndist{d, (g)*, Fk) = 0, otherwise, syndist{d, (g)*, Fk) is 
minK "^^^'"*""' ) fc : 3q;i, ai such that aia2...ai — d and for 1 < i < Z, 
ctj is non-null and Ui — syndist{ai, g,Fk)} 

The values of syndist{d, f, F^o) are defined as follows. 

• For a e A, syndist{d,a, Foq) = 1 — simval{do,a) when n — 1, i.e. 
length{d) = 1; otherwise syndist{d, a, F^o) = oo. 

• For the RE g V h, 

syndist{d, g\/ h, F^) = mm{syndist{d, g, F^), syndist{d, h, F^)}. 

• syndist{d, gh, Foo) — '!mn{meix{syndist{o, g, F^o), syndist{P, g, F^)} : 
a, (3 are database sequences such that a(3 — d}. 

• We 

define syndist{d, ((?)*, Foo) as follows. We define syndist{d, (g)*, F^o) to 
be mm{max{syndist{ai, g, Foo), syndist{ai, g, Foo)} '■ OLia2...a.i — d 
and each is non-null} 

For each A; = 1, cx), we define two semantic distance functions semdisti 
and semdist2 as follows. For any database sequences d and RE / and for 
j — 1,2, semdistj{d, f, Fk) = distance j {d, L{f), Fk). The following lemma 
can be easily proven. It shows that the syntactic distance and the semantic 
distance given by semdisti are identical. 

LEMMA 5.1: For each databases sequence d and RE / and for each 
k = 1, oo, syndist{d, f, Fk) = semdisti{d, f, Fk). 

Proof: We give the proof for the case when k < oo. The proof 
is similar for the case k — oo. The lemma is proved by induction on 
the length of /. In the base case, the length of / is one and f — a 
for some a G A. From definition of the two distance measures, it is 
easy to sec that syndist{d, f, Fk) = semdistild, f, Fk). As an induc- 
tion hypothesis, assume that the lemma is true for all / of length less 
than or equal to r. Now consider a RE / whose length is r + 1. We 
consider the different cases. The first case is when / is of the form 
g V h. In this case, from the definitions, we have semdisti{d, g V h,Fk) = 
di stance I {d, L{g V h),Fk) and which equals di stance i {d, L{g) U L{h),Fk), 
which is mm{distancei{d, L{g), Fk), distancei{d, L{h), Fk)}. The later is 
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mm{semdisti{d, g, Fk), semdisti{d, h, Fk)} and this by induction equals 
mm{syndist{d, g, Fk), syndist{d, h, Fk)} which is syndist{d, f, Fk). 

Now consider the case when / = gh. We have se'mdisti{d, gh, Fk) — 
distancei{d, L{g)L{h), Fk). The later is mm{distancei{d,a(3, Fk) : a e 
L((yf), P G L{h)}. This value is given by 

(*) m\\i{distancei{did2,otl3,Fk) : a G L[g), j3 G L{h),length{di) = 
length{a) , did2 = d}. 

It is easy to see that, for database sequences di, d2 such that d = did2 and 
length{a) — length{di) and 

length{d) = length{ap), distancei{did2, ap, Fk) = ( '^"g*^^'^^ iengtfe(df ^^'^^ ^ ) ^ 
where u — distancei{di, a, Fk) and v = distancei{d2, (3, Fk). Let E de- 
note the expression { ^^^^^^'^'^^\^ngth(d)^^^'^^^"" ) ^ • Substituting this in (*), we 
get semdisti{d, gh, Fk) = min{ E : a E L{g), (3 G L{h) ,length{di) = 
length{a) , did2 = d}. Since we are taking the minimum on the right 
hand side, it is not difficult to see that we can choose a to be the one 
that gives the minimum value for u and this minimum value of u is 
semdisti{di, g, Fk). Similarly, wc take v to be the semdisti{d2, h, Fk) . Thus 
we get, semdisti{d, gh, Fk) = min{ E : u = semdisti{di, g, Fk),v = 
semdisti{d2,h,Fk),did2 = d}. Note that, here we take the minimum over 
all di, d2 such that did2 — d. (It is to be noted that there may be com- 
binations of di,d2 for which there may not be strings of length di in L{g) 
or strings of length d2 in L{h); in these cases, it is easy to sec that cither 
u = oo or f =00 respectively and hence E =00. Hence these additional 
combinations do not change the minimum.) Using the induction hypothesis, 
we have semdisti{di, g, Fk) = syndist{di, g, Fk) and se'mdisti{d2, h, Fk) = 
syndist{d2, h, Fk). Using this, we have semdisti{d, gh, Fk) — min{ E : u — 
syndist{di, g, Fk),v = syndist{d2, h, Fk), did2 = d}. From the definitions, we 
see that the right hand side is syndist{d, gh, Fk). The proof of the induction 
step for the case when / = (g)* is similar and is left to the reader. □ 

For a RE /, let A(f) he a standard non-deterministic automaton that 
accepts L{f) and such that the size of A{f) is hnear in length{f) (see [LP98]). 
The values of semdisti{d, f , Fk) for each k = l,...,oo can be computed by 
constructing the automaton A{f) (possibly non-deterministic) and using the 
algorithm given in [HSOO]. These algorithms are of complexity 0{length{d) ■ 
length{f)). 

Given a database sequence d and RE /, semdist2{d, f , Fk) is computed 
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exactly on the same lines as given in section 3. First we obtain the automaton 
A that accepts all strings in A* — L{f). The size of the resulting automa- 
ton will be 0(2'^"^*''^-^)). The reminder of the steps is same as given in the 
section 3. As before, the complexity of the algorithm is triple exponential 
in length{f) but linear in length{d). Because of this complexity, it might 
be better to use the syntactic distance measure for similarity based retrieval. 
Note that this distance function is also same as the first semantic distance 
function semdisti. 

6 Related Work 

There have been various formalisms for representing uncertainty (see [Ha03]) 
such as probability measures, Dempster-Shafer belief functions, plausibility 
measures, etc. Our similarity measures for temporal logics and automata 
can possibly be categorized under plausibility measures and they are quite 
different from probabihty measures. The book [Ha03] also describes logics 
for reasoning about uncertainty. Also, probabilistic versions of Propositional 
Dynamic Logics were presented in [Ko83]. However, these works do not 
consider logics and formalisms on sequences, and do not use the various 
vector distance measures considered in this paper. 

Since the appearance of a preliminary version of this paper [Si02] , other 
non-probabilistic quantitative versions of temporal logic have been proposed 
in [A104, A103]. Both these works consider infinite computations and branch- 
ing time temporal logics. The similarity measure they give, for the linear time 
fragment of their logic, corresponds to the infinite norm among the vector 
distance functions. On the contrary, we consider formalism and logics on 
finite sequences and give similarity based measures that use a spectrum vec- 
tor of distance measures. We also present methods fo computing similarity 
values of a database sequence with respect to queries given in the different 
formalisms. 

There has been much work done on querying from time-series and other 
sequence databases. For example, methods for similarity based retrieval from 
such databases have been proposed in [FRM94, AFS93, ALSS95, B97, RM97, 
NRS99]. These methods assume that the query is also a single sequence, not 
a predicate on sequences as we consider here. 

There has also been much work done on data-mining over time series data 
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[AS94, GRS99] and other databases. These works mostly consider discovery 
of patterns that have a given minimum level of support. They do not consider 
similarity based retrieval. 

A temporal query language and efficient algorithms for similarity based 
retrieval have been presented in [SYV97]. That work uses a a syntactic 
distance measure which is ad hoc. On the contrary, in this work, we consider 
syntactic as well as semantic distance measures. Further, in this paper, we 
consider a spectrum of these measures based on well accepted standard norm 
distance measures on vectors. 

There has been work done on approximate pattern matching (see [WM92] 
for references) based on regular expressions. They use different distance mea- 
sures. For example, they usually use the edit distance as a measure and 
look for patterns defined by a given regular expression with in a given edit 
distance. On the other hand, we consider average measures; for example, 
the distance function Fi defines average block distance. In the area of bio- 
informatics much work has been done on sequence matching (sec [D98] for 
references). Most of this work is based on probabilistic models (such Markov 
or extended Markov models). They do not employ techniques based on in- 
dices for the subsequence search. 

Predicates on sequences have been employed in specifying triggers in Ac- 
tive Database Management Systems [C89, D88, GJS92, SW95a]. However, 
there exact semantics is used for firing and processing the triggers. 

Lot of work on fuzzy logic considers assignment of similarity values to 
prepositional formulae based on their syntax. However, to the best of our 
knowledge no other work has been done for logics on sequences. 

7 Conclusions and Discussion 

In this paper, we have considered languages based on automata, temporal 
logic and regular expressions for specifying queries over sequence databases. 
We have defined a variety of distance measures, based on the syntax and 
semantics of the queries. We have outlined algorithms for computing these 
values. The algorithms for computing syntactic distance measures are only 
of polynomial time complexity in the length of the query and polynomial in 
the length of the database sequence. The algorithms for computing the first 
semantic distance measure have lower complexity than the second semantic 
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distance measure. Thus, from the complexity point of view, it might be better 
to use the syntactic based measures or the first semantic distance measure. 
Some of algorithms for automata have been implemented and tested on real 
data (see [HSOO] for details). 

It is to be noted that when we defined, in section 2, the distance 
dist{d,a, Fk) between a database sequence d = {do, ...,dn~i) and a sequence 
a = (ao, a„_i) of atomic queries taken from 6, we assumed that we are 
given the values simval{di, Oj) for each i = 0, .., n — 1. We also assumed that 
these values lie in the range [O, l]. If we require that some atomic query, say 
5, should be exactly satisfied then giving a similarity value of either or 1 
may not achieve this purpose. Suppose that = 5 and rfg does not satisfy 
ao; then setting simval{do, S) = 0, and hence setting dist{do, 6) = 1, will not 
serve the purpose since for A; < oo will aggregate these values and the 
value of dist{d, a) may be much smaller than 1 if other database states in 
the sequence satisfy the corresponding atomic queries with similarity value 
1. (Of course, if Fqo then this is not a problem.) To achieve what we want, 
we need to set simval{do, 6) to be — oo. In this case, dist{d, a, Fk) will be 
oo for every A; > 0. Thus for those atomic queries which need to be exactly 
satisfied, we can define the similarity value of a database state with respect 
to these to be either infty or 1 denoting no satisfaction and perfect satis- 
faction respectively; note the corresponding distance values will be oo or 
respectively. Thus we can partition the set of atomic queries into two sets — 
those that need to be exactly satisfied for which the similarity values given 
are either — oo or 1, and the remianing for which the similarity values are 
given from the interval [0, 1]. It is not difficult to see that this scheme would 
work for the syntactic distance measures defined in sections 4 and 5. 

It is to be noted that all the distance measures that we defined are based 
on norm vector distance functions. We feel these vector distance functions 
are the most appropriate for the applications mentioned earlier in the paper. 
On the other hand, other distance functions between sequences, may be 
appropriate for other applications. For example, the edit distance may be 
appropriate in applications involving bio-informatics. As part of future work 
this needs further investigation. 
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